Quantitative structure–activity relationship modeling for predication of inhibition potencies of imatinib derivatives using SMILES attributes

Chronic myelogenous leukemia (CML) which is resulted from the BCR-ABL tyrosine kinase (TK) chimeric oncoprotein, is a malignant clonal disorder of hematopoietic stem cells. Imatinib is used as an inhibitor of BCR-ABL TK in the treatment of CML patients. The main object of the present manuscript is focused on constructing quantitative activity relationships (QSARs) models for the prediction of inhibition potencies of a large series of imatinib derivatives against BCR-ABL TK. Herren, the inbuilt Monte Carlo algorithm of CORAL software is employed to develop QSAR models. The SMILES notations of chemical structures are used to compute the descriptor of correlation weights (CWs). QSAR models are established using the balance of correlation method with the index of ideality of correlation (IIC). The data set of 306 molecules is randomly divided into three splits. In QSAR modeling, the numerical value of R2, Q2, and IIC for the validation set of splits 1 to 3 are in the range of 0.7180–0.7755, 0.6891–0.7561, and 0.4431–0.8611 respectively. The numerical result of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${CR}_{p}^{2}$$\end{document}CRp2 > 0.5 for all three constructed models in the Y-randomization test validate the reliability of established models. The promoters of increase/decrease for pIC50 are recognized and used for the mechanistic interpretation of structural attributes.

www.nature.com/scientificreports/ of optimal descriptors of correlation weights (DCW), and the construction of predictive models using the physicochemical conditions of corresponding experiments are unique options available in the CORAL software for the development of QSAR models [16][17][18][19][20][21][22] . The literature survey shows that the Index of Ideality of Correlation (IIC) has been applied to improve the statistical result of the QSAR model [23][24][25][26][27][28] . In addition, the most descriptors used in common QSAR models do not have physical meaning and can not be associated with mechanistic interpretation. It has to be noted that QSAR models developed with CORAL software are developed with SMILES notation based molecular descriptors that have mechanistic interpretation and could be associated with molecular fragments. The objective of the present work is to apply the inbuilt Monte Carlo algorithm of CORAL software for the building QSAR model to predict inhibition potencies (pIC 50 ) of 306 Imatinib derivatives against BCR-ABL tyrosine kinase (TK). The balance of correlation method with IIC is used to develop QSAR models. The reliability and predictability of the designed QSAR model are assessed by three random splits.

Method
Data. Zin et al. 29 had extracted the inhibition potential of 306 compounds for the human BCR-ABL tyrosinekinase from the ChEMBL v23 (2017) database 30 . The inhibition potential of compounds was defined as half maximal inhibitory concentration in mol/L (IC 50 ). Additionally, the inhibition experimental data of BCR-ABL tyrosine kinase was transformed to a negative logarithm value (pIC 50 ). The endpoint pIC 50 was taken as the dependent parameter for constructing QSAR models. The range of pIC 50 was between 9.37 and 4.03. Three splits were created form the dataset (n = 306) and the compounds of each split was randomly divided into the training (34%), invisible training (35%), calibration (15%) and validation (16%) sets. The SMILES notations, split distribution, experimental pIC 50 , predicted pIC 50, and applicability domain of each compound are depicted in Table S1. The task of each set in developing the QSAR models was already described in the literature 31,32 . Optimal SMILES-based descriptors. In the CORAL software, three types of optimal descriptors i.e.
SMILES-based, graph-based and hybrid descriptors (combination of SMILES and Graph) can be employed to develop QSAR models.
The optimal descriptor is a mathematical function of so-called correlation weights (CW). Correlation weights are numerical coefficients associated with various molecular features extracted from SMILES symols. In other words, the univariate models investigated in this research are based on the "descriptors of correlation weights" (DCW). The Monte Carlo algorithm was used to calculate the DCW. In the present research, the SMILES-based descriptor was employed to make the QSAR models. The optimal descriptors used to build pIC 50 models are calculated as follows: Here, T is the notation of threshold and N is the notation of the number of epochs. The T is an integer utilized to split SMILES attributes (i.e. Sk, SSk, and SSSk) into two classes i.e. active and rare. If a molecular attribute, A, takes place less than T times, then this molecular attribute should be omitted from the construction of the model ( molecular attribute is calculated from SMILES in the training set), hence the correlation weight of the A, CW(A) = 0. Therefore, this molecular attribute has been distinguished as rare. The T* and N* are the numerical values of the T and N that yield the best statistical result of a model for the calibration set.
The details of notation given in Eq. (2) are as follows: SSS k , a local SMILES attribute, is a combination of three SMILES atoms; NOSP, HALO, and BOND are global SMILES attributes that display the existence or absence of nitrogen (N), oxygen (O), sulfur (S), and phosphorus (P) (NOSP), fluorine, chlorine, and bromine (HALO); BOND illustrates the presence or absence of double (' = '), triple ('#') and stereochemical ('@' or '@@)' bonds; PAIR imply the combination of BOND and NOSP; HARD displays the presence or existence of NOSP, HALO, and BOND; C max represents the maximum number of rings; N max and O max are the total numbers of nitrogen and oxygen atoms in the molecular structure. The CW(A) demonstrates the correlation weight for the SMILESattributes e.g. SSS k , NOSP, BOND, HALO, PAIR, Cmax, Nmax, and Omax. These correlation weights are calculated using the Monte Carlo optimization [33][34][35][36][37] .
The obtained numerical data in terms of DCW is used to determine the inhibition potential for Imatinib derivatives (pIC 50 ) by the least square method using the following one-variable model: Monte Carlo optimization. In the present research modified target function (TF m ) i.e. the balance of correlation with IIC was employed to compute the DCW 32 . The following mathematical relationships are used to compute TF m : (1) DCW T * , N * = SMILES DCW T * , N *  www.nature.com/scientificreports/ Here, R training and R invTraining indicate the correlation coefficients for the training and invisible training sets, respectively. The empirical constant (Const) is usually fixed. The index of ideality if correlation for the calibration set (IIC CAL ) is calculated using the following equation: The 'k' is the index (1, 2, …. N). The observed k and calculated k are related to the endpoint.
Applicability domain. According to the 3rd principle of the OECD, the applicability domain (AD) is recommended for the validation of the established QSAR model. The physicochemical, structural, or biological space, knowledge, or information on which the model's training set was created and for which it is used to generate predictions about new compounds is known as the AD 38,39 .
In the CORAL program, Monte Carlo-based QSAR, scattering of SMILES attributes in the training, invisible training and calibration sets is utilized to achieve AD 40,41 . If a substance does not fall within the scope of AD, it is identified as an outlier and cannot be associated with a reliable prediction.
In CORAL, a compound is recognized in the scope of AD if the following inequality is fulfilled, otherwise, it is recognized as an outlier: where Defect TRN is an average of the statistical defect (D) for the dataset of the training set.
The statistical defect (D) can be described as the sum of statistical defects of all attributes present in the SMILES notation.
NA is the number of active SMILES attributes for the given compounds.
The "statistical defect," Defect(A) for an attribute of SMILES can be defined by the following mathematical equation: Validation of the model. The statistical eminence of the created QSAR models for pIC 50 of Imatinib derivatives is evaluated on the basis of the three methodologies: (i) internal validation or cross-validation by determining the R 2 , IIC, CCC, Q 2 , and F-test on the training set; (ii) external validation by determining the Q 2 F 1 , Q 2 F 2 , Q 2 F 3 , CRp 2 , s, MAE, r̅ m 2 , and Δr m 2 utilizing the test set substances and (iii) data randomization or Y-scrambling ( Table 1). The mathematical relationship of these statistical parameters has been provided in the literature [42][43][44][45][46] . In Table 1, Y obs is observation endpoint; Y prd is the prediction endpoint; R 2 and R 2 0 are the squared correlation coefficient values between the observed and predicted endpoints with intercept and without intercept respectively, and R 2 r is squared mean correlation coefficient of randomized models.

Results and discussion
QSAR models. With the mentioned data in "Data", three splits were generated randomly. Each split was further divided into four sets namely training, invisible training, calibration and validation sets. To establish the QSAR model, a balance of correlation with the IIC technique was employed. The values of IIC weight (weight of IIC) and dR weight (weight for dR in the balance of correlations) were 0.2, and 0.1, respectively. The result for the preferable T* and N* was 1 and 15 for all splits. With the best-preferred values of T* and N*, the pIC 50 (endpoint) for each split was computed and the developed QSAR models are as the following: www.nature.com/scientificreports/ The statistical characteristics of the generated QSAR models computed by relationships 13-15 are depicted in Table 2. The outcomes in Table 2 demonstrate that all generated QSAR models from the statistical point of view are appropriate and match the requirements of various validation criteria. The robustness of established QSAR models was demonstrated by the numerical value of R 2 and Q 2 values which were more than 0.5 and 0.7 47,48 . In addition, the numerical value of the R 2 m metric for the validation set of all designed QSAR models was satisfactory and follows the criteria suggested by Roy et al. 49 . Also, the R 2 m -scaled and R 2 m -scaled introduced as modified R 2 m metric by Roy et al. were computed 50 , these values were 0.6928 and 0.0216, 0.6878 and 0.0929, and 0.7339 and 0.1230 for split 1 to 3, respectively. The trustworthiness of the constructed QSAR models was also confirmed by the Y-randomization test.
After several repetitions of new random models were developed and the values of R 2 were found below 0.1 (see Table S2 as supplementary information). These result indicates that the correlation between pIC 50 and molecular attributes is not based on chance correlation. Moreover, for three splits, the CR 2 p was obtained greater than 0.75, which confirmed the non-chance correlation of developed models 51 .

Type of validation Criterion of the predictive potential
Internal www.nature.com/scientificreports/ The AD for each compound in models 1 to 3 shown in Table S1 based on the results of defectvalue. The percentages of compounds in the AD of models were 81, 83, and 87% for splits 1-3, respectively. It showed that the three prediction models were able to predict more than 80% of the new data. Figures 1 and 2 demonstrate the pictorial presentation of experimental data of pIC50 versus predicted pIC50 and residual pIC50 versus predicted pIC50 of three models. As can be seen in Fig. 1, there is good agreement between experimental and predicted data in the suggested models. It can also be seen in Fig. 2 that the dispersion of residual pIC50 near the horizontal line centred around zero. All these results confirmed that all constructed QSAR models were robust and well fitted.
Interpretation of the QSAR model. Mechanistic interpretation of models helps in understanding the effectiveness of descriptors in the predicted endpoint. The mechanistic interpretation of built-up QSAR models utilizing the CORAL program is done with correlation weights (CW) of SMILES-attributes which are achieved from several runs of the Monte Carlo optimization. The CW for each SMILES attributes in various probs of a model likely positive, negative, or both positive and negative. The positive and negative promoters are considered as promoters of increase and decrease of the activity or an endpoint, respectively. Consequently, promoters of increase of pIC50 have positive CW and promoters of decrease of pIC 50 have negative CW. But, if the structural attribute in all runs both positive and negative values of CW, then these attributes are undefined. Table 3 represents the list of the structural features as the promoters of increase or decrease of pIC 50 achieved in the results of three probs of the Monte Carlo optimization with optimum T* and N* along with the interpretation of the promoters (NT is number of attributes in the training set, NiT is number of attributes in the invisible training set, and NC is number of attributes in the calibration set). According to the results, the important SMILESdescriptors as the promoter of increase/decrease of pIC 50 were distinguished and recognized. The SMILES-based  The comparison QSAR model here with the previous study showed that the structure, physicochemical parameters or previous calculations of the chemicals descriptors for the construction of the models were required by the model, while in the case of CORAL software, a text file containing SMILES notations of compounds and endpoint was used for model development. Here, we used 3 splits to establish three QSAR models using four sets (training, invisible training, calibration and validation set), but in previously constructed models, a single split utilizing two sets (training and test set) was used. In the present research, the molecular features responsible for the increase/decrease of endpoint were also detected for mechanistic interpretation.
In terms of statistical characterization, the proposed QSAR model by CORAL for the prediction of pIC 50 was superior to the reported model. The statistical parameters Q 2 F1 , Q 2 F2 , Q 2 F3 , CR 2 p , CCC and IIC were not reported in the previous report. The R 2 of training and validation setes for split 1 to 3 are between 0.76-0.85 and 0.71-0.78, respectively and the MAE of training and validation sets for split 1 to 3 are between 0.41-0.54 and 0.46-0.54, respectively. Therfore, the QSAR models established here are more reliable and have better predictability.

Conclusion
In this work, to predict pIC 50 of 306 Imatinib derivatives, QSAR models were created using the Monte Carlo method and validated with several parameters. The QSAR models were established using a modified target function (TF m ). The statistical characterization of constructed models was justified using internal and external validation metrics such as R 2 , IIC, CCC, Q 2 , Q 2 F1 , Q 2 F2 , Q 2 F3 , F, s, MAE, RMSE, R 2 m , R 2 m , scaled-R 2 m , scaled-R 2 m , CR 2 p , and Y-randomization test. In the constructed QSAR model, the numerical value of R 2 , Q 2 , and IIC for the validation set of splits 1 to 3 were in the range of 0.7180-0.7755, 0.6891-0.7561, and 0.4431-0.8611 respectively. The domain of applicability (AD) was applied to identify the outliers in the generated QSAR models. The structural features as promoters of pIC 50 increase/decrease were also identified.

Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.